Start: 09 November 2017 (16:18:28)


Importing Nutri Data

nutri_orig <- read_xlsx("data/insa_tca.xlsx") 
#glimpse(nutri_orig)

Problem to make us cry 😢

Problem: Variable names of the the original data and the risk of incorrectly tidying the data.

When we look at the nutri_long dataset we see that we could lose the ordering of the keyVars. We need to protect again this causing a problem during the data tidying. The problem stems from the original data, looking at the names of nutri_orig:

names(nutri_orig)[1:12]
##  [1] "X__1"                    "Nome do alimento"       
##  [3] "Grupo"                   "Energia [kcal] (ENERCC)"
##  [5] "X__2"                    "X__3"                   
##  [7] "Energia [kJ] (ENERCJ)"   "X__4"                   
##  [9] "X__5"                    "Lípidos (FAT)"          
## [11] "X__6"                    "X__7"

Variable Energia [kcal] (ENERCC) contains the value for nutrient “Energia [kcal] (ENERCC)”, X__2 the associated unit, and X__3 the associated quantity. It is the same for variable Lípidos (FAT) contains the value for nutrient “Lípidos (FAT)”, X__6 the associated unit, and X__7 the associated quantity. This is the same pattern for all the nutrients as it originates from the original Excel spreadsheet where each nutrients has 3 columns but one header in a merged cell above the 3 columns.

Using the column names of the the original data we will create an ordering variable. First we will remove the the first three variables as they will remain constants as columns.

The we will create a grouping variable for the nutrients value, unit and quantity called grpNtr and grpType.

ordNames <- as_tibble(list(varName = names(nutri_orig)[-c(1, 2, 3)])) %>%
  mutate(grpNtr = ceiling(row_number()/3)) %>% 
  mutate(typeVars = parse_number(varName) %% 2) %>% 
  mutate(xVars = str_detect(varName, "^X__")) %>% 
  mutate(grpType = if_else(typeVars == 1, "Quantity", "Unit")) %>% 
  mutate(grpType = if_else(xVars == FALSE, "Nutrient", grpType)) %>% 
  select(-xVars, -typeVars)
## Warning: 37 parsing failures.
## row # A tibble: 5 x 4 col     row   col expected                               actual expected   <int> <int>    <chr>                                <chr> actual 1     1    NA a number              Energia [kcal] (ENERCC) row 2     4    NA a number                Energia [kJ] (ENERCJ) col 3     7    NA a number                        Lípidos (FAT) expected 4    10    NA a number      Ácidos gordos saturados (FASAT) actual 5    13    NA a number Ácidos gordos monoinsaturados (FAMS)
## ... ................. ... ........................................................... ........ ........................................................... ...... ........................................................... ... ........................................................... ... ........................................................... ........ ........................................................... ...... ...........................................................
## See problems(...) for more details.
head(ordNames, n=12)
## # A tibble: 12 x 3
##                            varName grpNtr  grpType
##                              <chr>  <dbl>    <chr>
##  1         Energia [kcal] (ENERCC)      1 Nutrient
##  2                            X__2      1     Unit
##  3                            X__3      1 Quantity
##  4           Energia [kJ] (ENERCJ)      2 Nutrient
##  5                            X__4      2     Unit
##  6                            X__5      2 Quantity
##  7                   Lípidos (FAT)      3 Nutrient
##  8                            X__6      3     Unit
##  9                            X__7      3 Quantity
## 10 Ácidos gordos saturados (FASAT)      4 Nutrient
## 11                            X__8      4     Unit
## 12                            X__9      4 Quantity

Tidying Data Step 1

From the nutri_orig data we will create a long format

Now we will merge the group ordering to the nutri_long dataset

nutri_long <- nutri_orig %>% 
  rename(foodID = X__1, 
         foodItem = `Nome do alimento`, 
         foodGroup = Grupo) %>% 
  gather(key = "keyVars", value = "keyVals", -starts_with("food")) %>% 
  select(foodID, foodGroup, foodItem, keyVars, keyVals) %>% 
  left_join(ordNames, by= c("keyVars" = "varName")) 

glimpse(nutri_long)
## Observations: 136,407
## Variables: 7
## $ foodID    <dbl> 619, 620, 802, 803, 703, 704, 646, 346, 345, 971, 97...
## $ foodGroup <chr> "Açúcar, confeitaria e sobremesas doces à base de ág...
## $ foodItem  <chr> "\"Donut\"", "\"Donut\" recheado com doce de fruta",...
## $ keyVars   <chr> "Energia [kcal] (ENERCC)", "Energia [kcal] (ENERCC)"...
## $ keyVals   <chr> "400", "348", "878", "900", "114", "293", "11", "78"...
## $ grpNtr    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ grpType   <chr> "Nutrient", "Nutrient", "Nutrient", "Nutrient", "Nut...

Tidying Nutri Data

In the nutri_long dataset we have 136407 observations. Our final tidy dataset, which we call nutri_tidy should contain one third of these values (i.e., 45469).

# Prepare the Units and Quantity so that they can be merged to the Nutrient data.
nUnit <- nutri_long %>% 
  filter(grpType == "Unit") %>% 
  select(foodID, grpNtr, Unit = keyVals) 

head(nUnit)
## # A tibble: 6 x 3
##   foodID grpNtr         Unit
##    <dbl>  <dbl>        <chr>
## 1    619      1 quilocaloria
## 2    620      1 quilocaloria
## 3    802      1 quilocaloria
## 4    803      1 quilocaloria
## 5    703      1 quilocaloria
## 6    704      1 quilocaloria
nQty <-  nutri_long %>% 
  filter(grpType == "Quantity") %>% 
  select(foodID, grpNtr, Quantity = keyVals) 

head(nQty)
## # A tibble: 6 x 3
##   foodID grpNtr                  Quantity
##    <dbl>  <dbl>                     <chr>
## 1    619      1 por 100 g de parte edível
## 2    620      1 por 100 g de parte edível
## 3    802      1 por 100 g de parte edível
## 4    803      1 por 100 g de parte edível
## 5    703      1 por 100 g de parte edível
## 6    704      1 por 100 g de parte edível
# Prepare the base for the Nutrition data by keeping only the nutrient and values
nutri_tidy <- nutri_long %>% 
  filter(grpType == "Nutrient") %>% 
  rename(Value = keyVals,
         Nutrient = keyVars) %>% 
  left_join(nUnit, by = c("foodID", "grpNtr")) %>% 
  left_join(nQty, by = c("foodID", "grpNtr")) %>% 
  select(-grpNtr, -grpType) %>% 
  arrange(foodID, Nutrient )

glimpse(nutri_tidy)
## Observations: 45,469
## Variables: 7
## $ foodID    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ foodGroup <chr> "Leite e produtos lácteos", "Leite e produtos lácteo...
## $ foodItem  <chr> "Leite de cabra cru", "Leite de cabra cru", "Leite d...
## $ Nutrient  <chr> "a-tocoferol (TOCPHA)", "Ácido linoleico (F18:2CN6)"...
## $ Value     <chr> "0.03", "0.1", "1.1", "0.1", "2.6", "0.1", "0", "86....
## $ Unit      <chr> "milligrama", "grama", "grama", "grama", "grama", "g...
## $ Quantity  <chr> "por 100 g de parte edível", "por 100 g de parte edí...

Cleaning Data

Data is tidy but not totally cleaned yet

nutri_tidy %>% arrange(foodGroup, foodItem) %>% select(foodItem)
## # A tibble: 45,469 x 1
##       foodItem
##          <chr>
##  1 "\"Donut\""
##  2 "\"Donut\""
##  3 "\"Donut\""
##  4 "\"Donut\""
##  5 "\"Donut\""
##  6 "\"Donut\""
##  7 "\"Donut\""
##  8 "\"Donut\""
##  9 "\"Donut\""
## 10 "\"Donut\""
## # ... with 45,459 more rows

We use stringr to clean up the values from foodItem where we can still find quotation marks (“”)

nutri_clean <- nutri_tidy %>%
        # clean observations
        mutate(foodItem = str_replace_all(
                foodItem, "\\p{quotation mark}", ""),
               Nutrient = str_replace_all(
                       Nutrient, "\\+", "")
               ) %>% 
        arrange(foodGroup, foodItem)
DT::datatable(nutri_clean)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html

Now when we search for “Donut” on the search window we no longer see it with quotation marks.

When we glimpse() the data we still see that the variable types are not accordingly

glimpse(nutri_clean)
## Observations: 45,469
## Variables: 7
## $ foodID    <dbl> 804, 804, 804, 804, 804, 804, 804, 804, 804, 804, 80...
## $ foodGroup <chr> "Açúcar, confeitaria e sobremesas doces à base de ág...
## $ foodItem  <chr> "Açúcar amarelo", "Açúcar amarelo", "Açúcar amarelo"...
## $ Nutrient  <chr> "a-tocoferol (TOCPHA)", "Ácido linoleico (F18:2CN6)"...
## $ Value     <chr> "0", "0", "0", "0", "0", "0", "0", "2", "0", "0", "4...
## $ Unit      <chr> "milligrama", "grama", "grama", "grama", "grama", "g...
## $ Quantity  <chr> "por 100 g de parte edível", "por 100 g de parte edí...

Convert variable types to their actual type

nutri_clean <- readr::type_convert(nutri_clean)
## Parsed with column specification:
## cols(
##   foodGroup = col_character(),
##   foodItem = col_character(),
##   Nutrient = col_character(),
##   Value = col_double(),
##   Unit = col_character(),
##   Quantity = col_character()
## )
glimpse(nutri_clean)
## Observations: 45,469
## Variables: 7
## $ foodID    <dbl> 804, 804, 804, 804, 804, 804, 804, 804, 804, 804, 80...
## $ foodGroup <chr> "Açúcar, confeitaria e sobremesas doces à base de ág...
## $ foodItem  <chr> "Açúcar amarelo", "Açúcar amarelo", "Açúcar amarelo"...
## $ Nutrient  <chr> "a-tocoferol (TOCPHA)", "Ácido linoleico (F18:2CN6)"...
## $ Value     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.00, 0.00...
## $ Unit      <chr> "milligrama", "grama", "grama", "grama", "grama", "g...
## $ Quantity  <chr> "por 100 g de parte edível", "por 100 g de parte edí...

End: 09 November 2017 (16:18:33)